Heuristics in Latent Space: VAEs and Bayesian Inference

pytorch
variational auto-encoders
amortized bayesian inference
Published

July 25, 2025

Abstract
The cost of generating new sample data can be prohibitive. There is a secondary but different cost which attaches to the ‘construction’ of novel data. Is the cost worth paying?
Transparency and Heuristics in Latent Space

Principal Components Analysis can be seen as a technique to optimally reconstruct a complex multivariate data set from a lower level compressed dimensional space. Variational auto-encoders allow us to achieve yet more flexible reconstruction results in non-linear cases. Drawing a new sample from the posterior predictive distribution of Bayesian models similarly supplies us with insight in the variability of realised data. Both methods assume a latent model of the data generating process that aims to leverage a compressed representation of the data. These are different heuristics with different consequences for how we understand the variability in the world. Both encode different degrees of inductive bias through architectural specification but of the two, only Bayesian methods do so transparently.

Reconstruction Error and Latent Representations

In applied statistical modeling, the hardest problems are often those we can’t directly observe. Inferring latent structure from noisy, incomplete data forces us to ask: what do we believe about the world, and how do our models reflect those beliefs? Can they reproduce plausible patterns? What latent representation in the model space determines our predictions?

This essay explores how different modeling choices impact the ability to re-construct patterns in observed data. Variational auto-encoder models are compared with structured Bayesian inference on how they handle reconstruction, imputation, and inductive bias. Along the way, we confront a central difficulty: in high-dimensional latent spaces, it’s easy to be wrong in convincing ways. Learning to “fill in the blanks” is only half the battle - the real challenge to retain high fidelity between the reconstruction process and the data-generating process. Superficial similarity is not enough.

What we cover:

  • PCA and SVD Linear Decompositions and reconstruction of complex multivariate data sets.

  • Marginal reconstruction with VAEs

    • How well do deep generative models recover observed variables when trained on complete data?
  • Mask-aware VAEs for missing data

    • Can we explicitly teach VAEs to ignore the gaps—and does it help?
  • Metric vs. structure recovery

    • Why good marginal fits can still hide deep structural errors.
  • Simple strategies, surprising results

    • When mean imputation and simple architectures outperform smarter models.
  • Bayesian inference as structured inductive bias

    • How priors over covariance structures yield robust estimates even with small samples.
  • Implicit imputation via the posterior

    • How the Bayesian model naturally fills in missing data, respecting uncertainty.

Throughout, we learn that reconstruction alone is not validation, and that model structure matters more than model complexity. Inductive bias isn’t a limitation—it’s our strongest tool for generalization.

import torch
import torchvision.datasets as dsets
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import pymc as pm 
import arviz as az

import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

Job Satisfaction Data

Job satisfaction is not one thing. It’s a bundle: how we think, how we feel, how we work. In our simulation, we model this bundle through twelve indicators—grouped into constructive thought strategies (JW1–JW3), dysfunctional patterns (UF1, UF2, FOR), well-being (DA1–DA3), and satisfaction outcomes (EBA, ST, MI). The Python code generates data by first defining a realistic correlation matrix and scaling it by standard deviations to yield a structured covariance matrix. From this, we draw synthetic observations using a multivariate normal distribution. This allows us to test estimators under controlled complexity.

Code
# Standard deviations
stds = np.array([0.939, 1.017, 0.937, 0.562, 0.760, 0.524, 
                 0.585, 0.609, 0.731, 0.711, 1.124, 1.001])

n = len(stds)

# Lower triangular correlation values as a flat list
corr_values = [
    1.000,
    .668, 1.000,
    .635, .599, 1.000,
    .263, .261, .164, 1.000,
    .290, .315, .247, .486, 1.000,
    .207, .245, .231, .251, .449, 1.000,
   -.206, -.182, -.195, -.309, -.266, -.142, 1.000,
   -.280, -.241, -.238, -.344, -.305, -.230,  .753, 1.000,
   -.258, -.244, -.185, -.255, -.255, -.215,  .554,  .587, 1.000,
    .080,  .096,  .094, -.017,  .151,  .141, -.074, -.111,  .016, 1.000,
    .061,  .028, -.035, -.058, -.051, -.003, -.040, -.040, -.018,  .284, 1.000,
    .113,  .174,  .059,  .063,  .138,  .044, -.119, -.073, -.084,  .563,  .379, 1.000
]

# Fill correlation matrix
corr_matrix = np.zeros((n, n))
idx = 0
for i in range(n):
    for j in range(i+1):
        corr_matrix[i, j] = corr_values[idx]
        corr_matrix[j, i] = corr_values[idx]
        idx += 1

# Covariance matrix: Sigma = D * R * D
cov_matrix = np.outer(stds, stds) * corr_matrix
#cov_matrix_test = np.dot(np.dot(np.diag(stds), corr_matrix), np.diag(stds))
FEATURE_COLUMNS=["JW1","JW2","JW3", "UF1","UF2","FOR", "DA1","DA2","DA3", "EBA","ST","MI"]
corr_df = pd.DataFrame(corr_matrix, columns=FEATURE_COLUMNS)

cov_df = pd.DataFrame(cov_matrix, columns=FEATURE_COLUMNS)

def make_sample(cov_matrix, size, columns, missing_frac=0.0, impute=False):
    sample_df = pd.DataFrame(np.random.multivariate_normal([0]*12, cov_matrix, size=size), columns=FEATURE_COLUMNS)
    if missing_frac > 0.0: 
        total_values = sample_df.size
        num_nans = int(total_values * missing_frac)

        # Choose random flat indices
        nan_indices = np.random.choice(total_values, num_nans, replace=False)

        # Convert flat indices to (row, col)
        rows, cols = np.unravel_index(nan_indices, sample_df.shape)

        # Set the values to NaN
        sample_df.values[rows, cols] = np.nan

    if impute: 
        sample_df.fillna(sample_df.mean(axis=0), inplace=True)


    return sample_df

sample_df = make_sample(cov_matrix, 263, FEATURE_COLUMNS)
sample_df.head(10)
JW1 JW2 JW3 UF1 UF2 FOR DA1 DA2 DA3 EBA ST MI
0 0.532182 0.268517 0.826548 0.899359 0.671467 0.836727 -0.707105 -0.687392 -0.033037 0.693069 1.493902 -0.020099
1 -0.391045 0.121261 -0.566634 0.642383 1.106285 0.731170 0.729951 0.889734 0.277911 0.804130 0.075732 1.646967
2 0.088972 0.478532 1.139142 0.114234 0.763116 -0.322672 -0.364538 0.357924 -0.602773 0.935978 1.223607 2.687337
3 0.550635 0.163589 -0.595279 -0.007784 -0.647865 0.671655 0.169744 -0.216028 0.281505 -0.511853 0.346339 0.551167
4 -1.835692 -0.664557 -1.304166 0.435669 -0.543643 -0.983968 -0.060496 -0.059377 0.067324 -0.514476 1.247916 -0.827021
5 -0.135666 -0.611336 -0.772474 0.327327 0.643752 -0.737789 -0.338781 -0.377258 -0.651957 -0.547149 -1.064167 -0.351964
6 -0.771523 0.184151 0.053396 0.426041 0.273997 0.087570 -0.319294 -0.777116 -0.838575 -0.427740 0.813530 0.763030
7 0.339776 0.371447 0.073694 0.514497 1.612967 0.413558 -0.633055 -0.081734 -1.696538 0.148815 -1.414748 0.787851
8 -1.635725 -0.075152 -0.442969 -0.542550 0.425583 0.658644 -0.032703 0.293563 0.981798 0.076522 0.113038 -0.151518
9 0.502535 0.047582 0.361115 0.051990 -0.069401 0.901829 0.240820 0.379505 -0.778813 0.186586 0.028588 -0.953463

A first test of any model for the job satisfaction process is whether it recovers the observed correlation structure. The code extracts pairwise correlations from the simulated data and visualizes them using a heatmap. Strong ties—positive or negative—appear as vivid blocks, offering a visual audit of relational intensity. This pattern matters. While it does not imply causation, it does offer a benchmark: if a model can’t replicate these patterns, it’s likely missing structure.

Code
data = sample_df.corr()

def plot_heatmap(data, title="Correlation Matrix",  vmin=-.2, vmax=.2, ax=None, figsize=(10, 6), colorbar=True):
    data_matrix = data.values
    if ax is None:
        fig, ax = plt.subplots(figsize=figsize)
    im = ax.imshow(data, cmap='magma', vmin=vmin, vmax=vmax)

    for i in range(data_matrix.shape[0]):
        for j in range(data_matrix.shape[1]):
            text = ax.text(
                j, i,                      # x, y coordinates
                f"{data_matrix[i, j]:.2f}",       # text to display
                ha="center", va="center",  # center alignment
                color="white" if data_matrix[i,j] < 0.5 else "black"  # contrast color
            )

    ax.set_title(title)
    ax.set_xticklabels(data.columns)  
    ax.set_xticks(np.arange(data.shape[1]))
    ax.set_yticklabels(data.index)  
    ax.set_yticks(np.arange(data.shape[0]))
    if colorbar:
        plt.colorbar(im)

plot_heatmap(data, vmin=-1, vmax=1)

Singular Value Decomposition and Reconstruction Error

Multivariate systems often lie on lower-dimensional structures. Principal Component Analysis (PCA), via Singular Value Decomposition (SVD), helps us explore this. Any data matrix \(X\) can be decomposed as \(X = U\Sigma V^{T}\) where \(\Sigma\) contains singular values ordered by importance. Truncating this expansion at rank \(k\) reconstructs an approximation of the original. The code below shows that even with only 2 or 5 components, much of the structure is retained. As we increase \(k\) the reconstruction error falls—revealing the data’s latent geometry.

X = make_sample(cov_matrix, 100, columns=FEATURE_COLUMNS)
U, S, VT = np.linalg.svd(X, full_matrices=False)

SVD decomposes a matrix into interpretable parts: directions of variation, scaled by importance. The reconstruction plots show that a handful of components (principal axes) capture most of the variation in job satisfaction data. This supports the idea of latent representations: the surface complexity of our variables arises from simpler underlying factors. Variational autoencoders (VAEs) and Bayesian posterior predictive sampling attempt the same compression and recovery, but embed it within generative models. As we’ll see, their reconstruction quality also reflects the strength and focus of their latent encoding.

Code
ranks = [2, 5, 12]
reconstructions = []
for k in ranks:
    X_k = U[:, :k] @ np.diag(S[:k]) @ VT[:k, :]
    reconstructions.append(X_k)

# Plot original and reconstructed matrices
fig, axes = plt.subplots(1, len(ranks) + 1, figsize=(10,15))
axes[0].imshow(X, cmap='viridis')
axes[0].set_title("Original")
axes[0].axis("off")

for ax, k, X_k in zip(axes[1:], ranks, reconstructions):
    ax.imshow(X_k, cmap='viridis')
    ax.set_title(f"Rank {k}")
    ax.axis("off")

plt.suptitle("Reconstruction of Data Using SVD \n various truncation options",fontsize=12, x=.5, y=1.01)
plt.tight_layout()
plt.show()

SVD shows us that high-dimensional data can often be compressed with little loss. But it’s linear and fixed: it doesn’t learn. Variational autoencoders generalize this idea. Like PCA, they seek a latent representation, but they do so through non-linear mappings learned from data. An autoencoder has two parts: an encoder, which compresses data into a latent code, and a decoder, which reconstructs it. Unlike PCA, this process is flexible and adapts to the data distribution, not just its linear axes. The variational auto-encoder (VAE) extends this idea by placing probability distributions over these latent codes, allowing us to sample and generate new data points.

Variational Auto-Encoders

The NumericVAE class captures this structure. The encoder maps observed data to the parameters of a distribution over latent variables \(z\) specifically a Gaussian with learned mean and variance: \(q(z|x) = N(\mu(x), \sigma^{2}(x))\) Using the reparameterization trick, we sample \(z \sim q(z| x)\) and decode it into a predicted distribution over the observed variables \(p(x |z)\). This makes the model generative: not only can it reconstruct known data, it can synthesize new, plausible examples by sampling from the latent prior \(p(z) \sim N(0, I)\) and decoding. The result is a flexible, probabilistic architecture that mimics and extends the SVD idea of structure through compression.

class NumericVAE(nn.Module):
    def __init__(self, n_features, hidden_dim=64, latent_dim=8):
        super().__init__()
        
        # ---------- ENCODER ----------
        # First layer: compress input features into a hidden representation
        self.fc1 = nn.Linear(n_features, hidden_dim)
        
        # Latent space parameters (q(z|x)): mean and log-variance
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)       # μ(x)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)   # log(σ^2(x))
        
        # ---------- DECODER ----------
        # First layer: map latent variable z back into hidden representation
        self.fc2 = nn.Linear(latent_dim, hidden_dim)
        
        # Output distribution parameters for reconstruction p(x|z)
        # For numeric data, we predict both mean and log-variance per feature
        self.fc_out_mu = nn.Linear(hidden_dim, n_features)        # μ_x(z)
        self.fc_out_logvar = nn.Linear(hidden_dim, n_features)    # log(σ^2_x(z))

    # ENCODER forward pass: input x -> latent mean, log-variance
    def encode(self, x):
        h = F.relu(self.fc1(x))       # Hidden layer with ReLU
        mu = self.fc_mu(h)            # Latent mean vector
        logvar = self.fc_logvar(h)    # Latent log-variance vector
        return mu, logvar

    # Reparameterization trick: sample z = μ + σ * ε  (ε ~ N(0,1))
    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)   # σ = exp(0.5 * logvar)
        eps = torch.randn_like(std)     # ε ~ N(0, I)
        return mu + eps * std           # z = μ + σ * ε

    # DECODER forward pass: latent z -> reconstructed mean, log-variance
    def decode(self, z):
        h = F.relu(self.fc2(z))             # Hidden layer with ReLU
        recon_mu = self.fc_out_mu(h)        # Mean of reconstructed features
        recon_logvar = self.fc_out_logvar(h)# Log-variance of reconstructed features
        return recon_mu, recon_logvar

    # Full forward pass: input x -> reconstructed (mean, logvar), latent params
    def forward(self, x):
        mu, logvar = self.encode(x)            # q(z|x)
        z = self.reparameterize(mu, logvar)    # Sample z from q(z|x)
        recon_mu, recon_logvar = self.decode(z)# p(x|z)
        return (recon_mu, recon_logvar), mu, logvar

    # Sample new synthetic data: z ~ N(0,I), decode to x
    def generate(self, n_samples=100):
        self.eval()
        with torch.no_grad():
            # Sample z from standard normal prior
            z = torch.randn(n_samples, self.fc_mu.out_features)
            
            # Decode to get reconstruction distribution parameters
            cont_mu, cont_logvar = self.decode(z)
            
            # Sample from reconstructed Gaussian: μ_x + σ_x * ε
            return cont_mu + torch.exp(0.5 * cont_logvar) * torch.randn_like(cont_mu)

This generative ability connects VAEs to Bayesian modeling. In Bayesian inference, posterior predictive sampling draws new observations by integrating over uncertainty in parameters. VAEs achieve something similar by sampling from a learned latent distribution and propagating it through the decoder. Both approaches simulate data from a model conditioned on what has been learned. But where Bayesian models emphasize interpretability and prior structure, VAEs emphasize flexibility and scalability with “deep” latent structure. In the next section, we use this NumericVAE to see how well it reconstructs and generates data compared to posterior predictive draws from a Bayesian multivariate normal model.

Variational autoencoders are trained by pairing their generative structure with a principled loss. The encoder maps input data to a distribution over latent variables; the decoder maps sampled latents back to data. This two-step process is trained end-to-end by minimizing a loss that balances two goals: accurate reconstruction and regularization of the latent space. The vae_loss function captures this trade-off. It computes a reconstruction loss using the negative log-likelihood of a Gaussian, plus a penalty—the Kullback-Leibler (KL) divergence—between the approximate posterior \(q(z | x)\) and the prior \(p(x)\)

def vae_loss(recon_mu, recon_logvar, x, mu, logvar):
    # Reconstruction loss: Gaussian log likelihood
    recon_var = torch.exp(recon_logvar)
    recon_nll = 0.5 * (torch.log(2 * torch.pi * recon_var) + (x - recon_mu) ** 2 / recon_var)
    recon_loss = recon_nll.sum(dim=1).mean()  # sum over features, mean over batch

    # KL divergence: D_KL(q(z|x) || p(z)) where p(z)=N(0,I)
    kl_div = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=1)
    kl_loss = kl_div.mean()

    return recon_loss + kl_loss, recon_loss, kl_loss

The reconstruction loss assumes the decoder outputs a Gaussian distribution for each input dimension. The term recon_nll \(=log(2\pi\sigma^{2}) + \frac{(x-\mu)^{2}}{\sigma^{2}}\) measures how surprising the observed data \(x\) would be under the predicted mean and variance. It penalizes poor fits and rewards precise reconstructions. This is summed across features and averaged across the batch. High-fidelity reconstruction requires the decoder to learn how each latent variable shapes the original space, making the VAE a powerful model for estimating high-dimensional latent representations that emit full probability distributions over inputs.

The KL divergence term acts as a regularizer. It measures how far the encoder’s learned distribution \(q(z∣x)\) is from the standard normal prior \(p(z)\). Without this term, the model could overfit—memorizing reconstructions without learning a useful latent structure. By keeping the latent codes close to \(N(0,I)\), the model remains generative: we can sample from the prior and decode into valid synthetic data. In short, KL divergence ensures generalization and interpretability, while reconstruction loss ensures fidelity. Their sum is what guides learning in the VAE.

Training VAE models

At the core of training deep learning models is backpropagation—an algorithm that computes gradients of a loss function with respect to each model parameter. PyTorch handles this automatically. We seperate the data into batches and compute the loss for each batch, Then calling loss.backward() computes these gradients via the chain rule. The optimizer then adjusts the weights with optimizer.step(), moving the model parameters in the direction that reduces the loss. Each training iteration refines the model, gradually improving its ability to encode and reconstruct the data. The train_vae function applies this cycle across epochs, using PyTorch’s autograd system to manage the entire computation graph.

The prep_data_vae function creates simulated data, adds optional missingness, and splits it into training and test sets. Each training set—of size 500, 1,000, or 10,000—is passed to a new VAE instance for fitting. More data provides better coverage of the latent space, allowing the model to generalize more confidently. With small data, the model risks memorizing noise. With large data, KL regularization helps the decoder learn smooth, generalizable mappings from latent codes to input reconstructions. The training runs below test this sensitivity empirically, keeping architecture fixed and varying only the sample size.

Code
def prep_data_vae(sample_size=1000, missing_frac=0.0, impute=False):
    sample_df = make_sample(cov_matrix=cov_matrix, size=sample_size, columns=FEATURE_COLUMNS, missing_frac=missing_frac, impute=impute)

    X_train, X_test = train_test_split(sample_df.values, test_size=0.2, random_state=890)

    X_train = torch.tensor(X_train, dtype=torch.float32)
    X_test = torch.tensor(X_test, dtype=torch.float32)

    train_loader = torch.utils.data.DataLoader(X_train, batch_size=32, shuffle=True)
    test_loader = torch.utils.data.DataLoader(X_test, batch_size=32)
    return train_loader, test_loader

Training a variational autoencoder means adjusting a large array of weights to minimize reconstruction error and regularize the latent space. The train_vae function loops over batches from the training set, computes the loss, and updates the model via backpropagation. Each batch yields predicted distributions over features, from which we compute both the reconstruction loss and the KL divergence. These are averaged and logged for both training and held-out test data. Early stopping halts training when improvements plateau—preventing overfitting while preserving the best model seen so far.

def train_vae(vae, optimizer, train, test, patience=10, wait=0, n_epochs=1000):
    best_loss = float('inf')
    losses = []

    for epoch in range(n_epochs):
        vae.train()
        train_loss = 0.0
        
        for batch in train:
            optimizer.zero_grad()

            (recon_mu, recon_logvar), mu, logvar = vae(batch)
            loss, recon_loss, kl_loss = vae_loss(recon_mu, recon_logvar, batch, mu, logvar)

            loss.backward()
            optimizer.step()

            train_loss += loss.item() * batch.size(0)

        avg_train_loss = train_loss / train.dataset.shape[0]

        # --- Test Loop ---
        vae.eval()
        test_loss = 0.0
        with torch.no_grad():
            for batch in test:
                (recon_mu, recon_logvar), mu, logvar = vae(batch)
                loss, _, _ = vae_loss(recon_mu, recon_logvar, batch, mu, logvar)
                test_loss += loss.item() * batch.size(0)
        avg_test_loss = test_loss / test.dataset.shape[0]

        print(f"Epoch {epoch+1}/{n_epochs} | Train Loss: {avg_train_loss:.4f} | Test Loss: {avg_test_loss:.4f}")

        if  avg_test_loss < best_loss - 1e-4:
            best_loss, wait = avg_test_loss, 0
            best_state = vae.state_dict()
        else:
            wait += 1
            if wait >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                vae.load_state_dict(best_state)  # restore best
                break
        losses.append([avg_train_loss, avg_test_loss, best_loss])

    return vae, pd.DataFrame(losses, columns=['train_loss', 'test_loss', 'best_loss'])

We are now in a position to fit the same model architecture to different samples of the data. Training on 10,000 samples, for instance, challenges the model to encode the full covariance structure of our job satisfaction data while retaining generative flexibility. This training routine sets the stage for comparing VAEs to Bayesian models, where posterior predictive sampling plays a similar generative role—but with different assumptions and computational tradeoffs.

train_500, test_500 = prep_data_vae(500)
vae = NumericVAE(n_features=train_500.dataset.shape[1], hidden_dim=64)
optimizer = torch.optim.Adam(vae.parameters(), lr=1e-3)
vae_fit_500, losses_df_500 = train_vae(vae, optimizer, train_500, test_500)

train_1000, test_1000 = prep_data_vae(1000)
vae = NumericVAE(n_features=train_1000.dataset.shape[1], hidden_dim=64)
optimizer = torch.optim.Adam(vae.parameters(), lr=1e-3)
vae_fit_1000, losses_df_1000 = train_vae(vae, optimizer, train_1000, test_1000)

train_10_000, test_10_000 = prep_data_vae(10_000)
vae = NumericVAE(n_features=train_10_000.dataset.shape[1], hidden_dim=64)
optimizer = torch.optim.Adam(vae.parameters(), lr=1e-3)
vae_fit_10_000, losses_df_10_000 = train_vae(vae, optimizer, train_10_000, test_10_000)

Training and test losses tell us how well the model fits the data—and how well it generalizes. The plots show these losses over training epochs for datasets of size 500, 1,000, and 10,000. A falling training loss indicates the model is learning to reconstruct the data. If test loss follows the same trend, generalization is improving too. If test loss flattens or rises while training loss keeps falling, the model may be overfitting—memorizing rather than learning.

Code
def plot_metric_recovery(fit_vae, test_dataset):
    recon_df = pd.DataFrame(fit_vae.generate(test_dataset.shape[0]), columns=FEATURE_COLUMNS)

    test_df = pd.DataFrame(test_dataset, columns=FEATURE_COLUMNS)

    fig, axs = plt.subplots(6, 2, figsize=(8, 20))
    axs = axs.flatten()
    for col, ax in zip(FEATURE_COLUMNS, axs):
        ax.hist(recon_df[col], color='red', alpha=0.5, ec='grey', label=f'Reconstructed Metric {col}')
        ax.hist(test_df[col], color='blue', alpha=0.5, ec='black', label=f'Test Metric {col}')
        ax.legend()

fig, axs = plt.subplots(1, 3, figsize=(8, 6))
axs=axs.flatten()
losses_df_500[['train_loss', 'test_loss']].plot(ax=axs[0])
losses_df_1000[['train_loss', 'test_loss']].plot(ax=axs[1])
losses_df_10_000[['train_loss', 'test_loss']].plot(ax=axs[2])

axs[0].set_title("Training and Test Losses \n 500 observations");
axs[1].set_title("Training and Test Losses \n 1000 observations");
axs[2].set_title("Training and Test Losses \n 10_000 observations");

These curves also validate our early stopping rule. When test loss plateaus, training halts—preserving the model that best balances fidelity and generalization. The plots confirm that this dynamic works across scales. In sum, training and test loss curves are not just diagnostics of performance—they’re windows into the model’s learning process and how data scale affects estimation.

We can also see that the models are able to re-construct the marginal metric distributions

plot_metric_recovery(vae_fit_10_000, test_10_000.dataset)

To evaluate how well the VAE learns the joint structure of the data and not just feature-wise accuracy, we compare the correlation matrices of the reconstructed data. We treat the test set as ground truth and ask: does the VAE, when generating synthetic data, reproduce its dependency structure? For each bootstrap iteration, we sample new data from the trained VAE and compute its correlation matrix. We subtract this from the test set’s correlation matrix, storing the residuals. Averaging across 1,000 samples gives us the expected discrepancy between true and learned structure.

This bootstrapping approach is key. Any one synthetic sample from the VAE might show idiosyncratic deviations due to randomness in the generative process. By repeating the process many times, we average out this noise and obtain a stable, expected residual. This makes the evaluation fair: it judges the model based on the distribution it has learned, not on any one lucky—or unlucky—draw. A good model will show consistently small residuals, especially along the strongest correlations of the test set.

def bootstrap_residuals(vae_fit, X_test, n_boot=1000):
    recons = []
    resid_array = np.zeros((n_boot, 12, 12))
    for i in range(n_boot):
        recon_data = vae_fit.generate(n_samples=len(X_test))
        reconstructed_df = pd.DataFrame(recon_data, columns=FEATURE_COLUMNS)
        resid = pd.DataFrame(corr_matrix, columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS) - reconstructed_df.corr()
        resid_array[i] = resid.values
        recons.append(reconstructed_df)

    avg_resid = resid_array.mean(axis=0)
    bootstrapped_resids = pd.DataFrame(avg_resid, columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS)
    return bootstrapped_resids

bootstrapped_resids_500 = bootstrap_residuals(vae_fit_500, pd.DataFrame(test_500.dataset, columns=FEATURE_COLUMNS))

bootstrapped_resids_1000 = bootstrap_residuals(vae_fit_1000, pd.DataFrame(test_1000.dataset, columns=FEATURE_COLUMNS))

bootstrapped_resids_10_000 = bootstrap_residuals(vae_fit_10_000, pd.DataFrame(test_10_000.dataset, columns=FEATURE_COLUMNS))


fig, axs = plt.subplots(3, 1, figsize=(10, 20))
axs = axs.flatten()
plot_heatmap(bootstrapped_resids_500, title="""Expected Correlation Residuals for 500 observations \n Under 1000 Bootstrapped Reconstructions""", ax=axs[0], colorbar=True, vmin=-.25, vmax=.25)

plot_heatmap(bootstrapped_resids_1000, title="""Expected Correlation  Residuals  for 1000 observations \n Under 1000 Bootstrapped Reconstructions""", ax=axs[1], colorbar=True, vmin=-.25, vmax=.25)

plot_heatmap(bootstrapped_resids_10_000, title="""Expected Correlation  Residuals  for 10,000 observations \n Under 1000 Bootstrapped Reconstructions""", ax=axs[2], colorbar=True, vmin=-.25, vmax=.25)

The resulting heatmaps visualize the average residuals across correlation pairs. With 500 training samples, patterns are weaker and more erratic. At 1,000, the VAE begins to capture more of the joint structure. With 10,000, the residuals shrink further, especially near key variable clusters. This confirms that the VAE learns not only to reconstruct inputs but to replicate their internal relationships—an essential feature for any model claiming to approximate the data-generating process.

Missing Data and Inductive Bias

Missing data is a common challenge in survey research, especially in job satisfaction studies. Respondents often skip questions—sometimes at random, sometimes in patterns related to stress, disengagement, or privacy concerns. This creates non-response bias, where the absence of data is itself informative. Any imputation method must account not only for what’s missing, but why it’s missing. Ignoring this can distort correlation estimates and lead to misleading inferences.

sample_df_missing = sample_df.copy()

# Randomly pick 5% of the total elements
mask_remove = np.random.rand(*sample_df_missing.shape) < 0.05

# Set those elements to NaN
sample_df_missing[mask_remove] = np.nan
sample_df_missing.head()
JW1 JW2 JW3 UF1 UF2 FOR DA1 DA2 DA3 EBA ST MI
0 0.532182 0.268517 0.826548 0.899359 0.671467 0.836727 -0.707105 -0.687392 -0.033037 0.693069 1.493902 -0.020099
1 -0.391045 0.121261 -0.566634 0.642383 1.106285 0.731170 0.729951 0.889734 0.277911 0.804130 0.075732 1.646967
2 0.088972 0.478532 1.139142 0.114234 0.763116 NaN -0.364538 0.357924 -0.602773 0.935978 1.223607 2.687337
3 0.550635 0.163589 -0.595279 -0.007784 -0.647865 0.671655 0.169744 -0.216028 0.281505 -0.511853 0.346339 0.551167
4 -1.835692 -0.664557 -1.304166 0.435669 -0.543643 -0.983968 -0.060496 -0.059377 0.067324 NaN 1.247916 -0.827021

In the code above, we simulate this problem by introducing 5% random missingness into the dataset. This models a common real-world scenario: partial responses from otherwise complete surveys. While this version assumes missing at random (MAR), in practice, missingness may depend on latent traits—precisely what our models aim to capture. A naive mean imputation might smooth over the problem, but risks losing structural information or introducing artificial correlations. We want to allow that our deep learning architecture can handle missing-data and we adapt our prep_data_vae_missing function to accomodate missing data.

Code
class MissingDataDataset(Dataset):
    def __init__(self, x, mask):
        # x and mask are tensors of same shape
        self.x = x
        self.mask = mask
        
    def __len__(self):
        return self.x.shape[0]
    
    def __getitem__(self, idx):
        return self.x[idx], self.mask[idx]

def prep_data_vae_missing(sample_size=1000, batch_size=32, missing_frac=0.2, impute=False):
    sample_df = make_sample(cov_matrix=cov_matrix, size=sample_size, columns=FEATURE_COLUMNS, missing_frac=missing_frac, impute=impute)

    X_train, X_test = train_test_split(sample_df.values, test_size=0.2, random_state=890)

    # Mask: 1=observed, 0=missing
    mask_train = ~pd.DataFrame(X_train).isna()
    mask_test = ~pd.DataFrame(X_test).isna()

    # Tensors (keep NaNs for missing values)
    x_train_tensor = torch.tensor(X_train, dtype=torch.float32)
    mask_train_tensor = torch.tensor(mask_train.values, dtype=torch.float32)

    x_test_tensor = torch.tensor(X_test, dtype=torch.float32)
    mask_test_tensor = torch.tensor(mask_test.values, dtype=torch.float32)

    train_dataset = MissingDataDataset(x_train_tensor, mask_train_tensor)
    test_dataset = MissingDataDataset(x_test_tensor, mask_test_tensor)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    return train_loader, test_loader

Deep learning models, including VAEs, come with inductive biases—assumptions about the structure of the data that guide learning in the face of ambiguity. These biases can be helpful or harmful. In the next section, we explore how our VAE behaves when trained and applied to incomplete data. This will show both the flexibility of generative models and the risks of overconfidence and poor generalisation.

To handle missing data, we extend the VAE by allowing it to learn how to impute. The NumericVAE_missing model introduces a vector of learnable parameters—self.missing_embeddings—one per feature. During training, missing values in the input \(x\) are replaced with these parameters using torch.where, giving the model freedom to discover plausible replacements. Unlike mean imputation, this strategy is dynamic: the imputation values are tuned as part of the model’s optimization, guided by the same loss function that trains the encoder and decoder.

class NumericVAE_missing(nn.Module):
    def __init__(self, n_features, hidden_dim=64, latent_dim=24, dropout_rate=0.2):
        super().__init__()
        self.n_features = n_features
        self.dropout = nn.Dropout(p=dropout_rate)

        # ---------- Learnable Imputation ----------
        # One learnable parameter per feature for missing values
        self.missing_embeddings = nn.Parameter(torch.zeros(n_features))

        # ---------- ENCODER ----------
        self.fc1_x = nn.Linear(n_features, hidden_dim)

        # Stronger mask encoder: 2-layer MLP
        self.fc1_mask = nn.Sequential(
            nn.Linear(n_features, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )

        # Combine feature and mask embeddings
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

        # ---------- DECODER ----------
        self.fc2 = nn.Linear(latent_dim, hidden_dim)
        self.fc_out_mu = nn.Linear(hidden_dim, n_features)
        self.fc_out_logvar = nn.Linear(hidden_dim, n_features)

    def encode(self, x, mask):
        # Impute missing values with learnable parameters
        x_filled = torch.where(
            torch.isnan(x),
            self.missing_embeddings.expand_as(x),
            x
        )

        # Encode features and mask separately
        h_x = F.silu(self.fc1_x(x_filled))
        h_x = self.dropout(h_x)
        h_mask = self.fc1_mask(mask)

        # Combine embeddings
        h = h_x + h_mask

        # Latent space
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def decode(self, z):
        h = F.silu(self.fc2(z))
        h = self.dropout(h)  # optional: decoder dropout
        recon_mu = self.fc_out_mu(h)
        recon_logvar = self.fc_out_logvar(h)
        recon_logvar = torch.clamp(self.fc_out_logvar(h), -5, 5)
        return recon_mu, recon_logvar

    def forward(self, x, mask):
        mu, logvar = self.encode(x, mask)
        z = self.reparameterize(mu, logvar)
        recon_mu, recon_logvar = self.decode(z)
        return (recon_mu, recon_logvar), mu, logvar

    def generate(self, n_samples=100):
        self.eval()
        with torch.no_grad():
            z = torch.randn(n_samples, self.fc_mu.out_features)
            recon_mu, recon_logvar = self.decode(z)
            return recon_mu + torch.exp(0.5 * recon_logvar) * torch.randn_like(recon_mu)

The model also needs to know what was missing. For this, we pass a binary mask indicating which values are observed. This mask is processed through a two-layer MLP (self.fc1_mask), creating a learned representation of the missingness pattern. The encoded mask is then added to the encoded (and imputed) features. This design allows the encoder to distinguish between genuine low values and missing entries—and to condition the latent representation not just on the data, but on its absence.

Aside from this masking logic, the rest of the model mirrors a standard VAE: it uses a reparameterized latent space, decodes into means and variances for each feature, and generates new samples by drawing from the latent prior. Dropout layers add regularization. The overall effect is to give the model flexibility in the face of partial data. It can learn both how to fill in what’s missing and how to encode uncertainty about those guesses. This approach injects an inductive bias that missingness matters—and that learning should account for it directly.

def vae_loss_missing(recon_mu, recon_logvar, x, mu, logvar, mask):
    # Clamp log-variance for numerical stability
    recon_logvar = torch.clamp(recon_logvar, min=-5.0, max=5.0)
    recon_var = torch.exp(recon_logvar)

    # Only use observed entries
    masked_x = torch.where(mask.bool(), x, torch.zeros_like(x))
    masked_diff_sq = ((masked_x - recon_mu) ** 2) * mask
    masked_logvar = recon_logvar * mask
    masked_var = recon_var * mask

    # Gaussian NLL: only computed on observed entries
    recon_nll = 0.5 * (
        torch.log(2 * torch.pi * masked_var + 1e-8) + masked_diff_sq / (masked_var + 1e-8)
    )

    # Reduce over features and average over batch
    obs_counts = mask.sum(dim=1).clamp(min=1)
    recon_loss = (recon_nll.sum(dim=1) / obs_counts).mean()

    # KL divergence
    kl_div = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=1)
    kl_loss = kl_div.mean()

    return recon_loss, kl_loss

To adapt the VAE’s loss function for missing data, we restrict reconstruction error to observed entries only. The function vae_loss_missing computes a masked negative log-likelihood (NLL), treating missing values as undefined and excluding them from the loss. This is done by elementwise multiplying the squared error and log-variance terms by the binary mask. The mean squared error is thus scaled only over the available data, avoiding any implicit assumptions about the unobserved values. This approach lets the model learn from incomplete input without penalizing it for what it can’t see.

The KL divergence term remains unchanged, regularizing the latent distribution regardless of missingness. However, the reconstruction term is adjusted for each sample by dividing by the count of observed entries (obs_counts), avoiding bias due to varying amounts of missing data across rows. Numerical stability is enforced via clamping and small epsilon terms. Together, these changes ensure the objective function reflects uncertainty where it should and remains focused on the parts of the data the model can justifiably be judged on.

The training loop remains largely unchanged, but we fit two versions of the model architecture sampling 5000 and 25,000 data points respectively.

def fit_vae_missing(vae_missing, train_loader, test_loader, optimizer, patience=10, wait=0, n_epochs=1000):
    best_loss = float('inf')
    losses = []

    for epoch in range(n_epochs):
        vae_missing.train()
        
        train_loss = 0
        for x_batch, mask_batch in train_loader:
            optimizer.zero_grad()
            (recon_mu, recon_logvar), mu, logvar = vae_missing(x_batch, mask_batch)
            recon_loss, kl_loss = vae_loss_missing(recon_mu, recon_logvar, x_batch, mu, logvar, mask_batch)

            loss = recon_loss + kl_loss
            loss.backward()
            optimizer.step()
            train_loss += loss.item() * x_batch.size(0)

        avg_train_loss = train_loss / len(train_loader.dataset)

        # --- Validation ---
        vae_missing.eval()
        test_loss = 0.0
        with torch.no_grad():
            for x_batch, mask_batch in test_loader:
                (recon_mu, recon_logvar), mu, logvar = vae_missing(x_batch, mask_batch)
                recon_loss, kl_loss = vae_loss_missing(recon_mu, recon_logvar, x_batch, mu, logvar, mask_batch)
                loss = recon_loss + kl_loss
                test_loss += loss.item() * x_batch.size(0)
        avg_test_loss = test_loss / len(test_loader.dataset)

        print(f"Epoch {epoch+1}/{n_epochs} | Train: {avg_train_loss:.4f} | Test: {avg_test_loss:.4f}")

        # Early stopping
        if avg_test_loss < best_loss - 1e-4:
                best_loss, wait = avg_test_loss, 0
                best_state = vae_missing.state_dict()
        else:
            wait += 1
            if wait >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                vae_missing.load_state_dict(best_state)  # restore best
                break
        losses.append([avg_train_loss, avg_test_loss, best_loss])
    losses_df = pd.DataFrame(losses, columns=['train_loss', 'test_loss', 'best_loss'])    

    return vae_missing, losses_df


train_loader_5000, test_loader_5000 = prep_data_vae_missing(5000, batch_size=32)
vae_missing_5000 = NumericVAE_missing(n_features=next(iter(train_loader_5000))[0].shape[1], latent_dim=12)
optimizer = optim.Adam(vae_missing_5000.parameters(), lr=1e-4)

fit_vae_missing_5000, losses_df_missing_5000 = fit_vae_missing(vae_missing_5000, train_loader_5000, test_loader_5000, optimizer)


train_loader_25_000, test_loader_25_000 = prep_data_vae_missing(25_000, batch_size=32)
vae_missing_25_000 = NumericVAE_missing(n_features=next(iter(train_loader_25_000))[0].shape[1], latent_dim=12)
optimizer = optim.Adam(vae_missing_25_000.parameters(), lr=1e-4)

fit_vae_missing_25_000, losses_df_missing_25_000 = fit_vae_missing(vae_missing_25_000, train_loader_25_000, test_loader_25_000, optimizer)

Plot Metric Reconstruction

The histograms below show how well the VAE reconstructs the marginal distributions of individual survey metrics. For each variable, we overlay the reconstructed values (in red) with the test set values (in blue). Visually, there’s strong overlap across most dimensions, suggesting the model has learned to approximate the correct univariate distributions—even with missing data. This kind of marginal recovery is encouraging, especially in low-data regimes or when imputing incomplete responses.

Code
plot_metric_recovery(fit_vae_missing_5000, test_loader_5000.dataset.x)

However, strong metric-wise recovery does not imply that the model has captured the joint structure of the data. Each histogram tells us how accurate the model is on a single variable, but ignores how variables co-vary—a crucial element in survey analysis where latent constructs often manifest as correlated responses. It’s entirely possible to match marginal distributions while failing to reproduce the inter-variable dependencies that define the data’s true geometry.

For this reason, histogram overlap is best viewed as a necessary but insufficient indicator of VAE reconstruction quality. It confirms that the decoder isn’t collapsing or hallucinating values, but it doesn’t validate the model’s deeper understanding of the data-generating process. In the next step, we’ll examine the reconstructed correlation structure—where, despite strong marginal performance, the VAE fails to capture the multivariate dependencies accurately.

Plot Correlation Reconstruction

The heatmaps below visualize the residuals between the observed and reconstructed correlation matrices; a more stringent test of reconstruction fidelity than individual metric distributions. While the marginal histograms previously gave the impression of strong model performance, the correlation residuals tell a different story. There are consistent and structured deviations between the original and reconstructed relationships, particularly off-diagonal. These residuals are not random noise; they reveal systematic failure to capture the joint dependencies among survey metrics.

bootstrapped_resids_5000 = bootstrap_residuals(fit_vae_missing_5000, pd.DataFrame(test_loader_5000.dataset.x, columns=FEATURE_COLUMNS))

bootstrapped_resids_25_000 = bootstrap_residuals(fit_vae_missing_25_000, pd.DataFrame(test_loader_25_000.dataset.x, columns=FEATURE_COLUMNS))


fig, axs = plt.subplots(2, 1, figsize=(8, 15))
axs = axs.flatten()
plot_heatmap(bootstrapped_resids_5000, title="""Expected Correlation Residuals for 5000 observations \n Under 1000 Bootstrapped Reconstructions""", ax=axs[0], colorbar=True, vmin=-.25, vmax=.25)

plot_heatmap(bootstrapped_resids_25_000, title="""Expected Correlation  Residuals  for 25,000 observations \n Under 1000 Bootstrapped Reconstructions""", ax=axs[1], colorbar=True, vmin=-.25, vmax=.25)

Notably, the magnitude and pattern of these residuals remain remarkably stable across both 5,000 and 25,000 observations. If the problem were due to data scarcity, we would expect improvements with more data. Instead, this invariance points to the model architecture itself—specifically, to the inductive biases embedded in the structure of the VAE. Despite being trained on rich and complete data, the model consistently struggles to recover the true correlational geometry.

This reflects a broader challenge in deep learning. Complex architectures like VAEs encode implicit assumptions about data structure through their layers and non-linearities. These assumptions are rarely transparent. In this case, the interaction between the encoder’s masking strategy, the learnable imputations, and the decoder’s reconstruction logic appears to flatten or distort inter-variable relationships in ways that simple performance metrics can’t detect. It’s a cautionary example of how black-box models can fail silently—even when outputs look convincing on the surface.

Simple Imputation and Simple Variational Encoders

Despite the sophistication of our masked VAE architecture, the earlier results showed it struggled to recover the true correlation structure of the data. In contrast, here we demonstrate that a simpler strategy—mean imputation followed by a standard VAE—performs substantially better.

Rather than explicitly modeling missingness, we use straightforward mean imputation, replacing each missing entry with the feature-wise mean from the training set. This naive approach ignores the potential structure in the missing data, but it has one key advantage: it preserves the full input dimensionality without introducing additional model components or parameters.

We then train a standard VAE on this imputed data. The correlation residual heatmap below shows that, unlike the masked VAE, this model accurately recovers the underlying correlation structure, even with 20% missing data. The off-diagonal residuals are smaller and less structured, indicating that the joint metric relationships are better preserved.

train_loader_imputed, test_loader_imputed = prep_data_vae(sample_size=10_000, missing_frac=0.2, impute=True)

vae = NumericVAE(n_features=train_loader_imputed.dataset.shape[1], hidden_dim=64)
optimizer = torch.optim.Adam(vae.parameters(), lr=1e-3)
vae_fit_10_000_imputed, losses_df_10_000_imputed = train_vae(vae, optimizer, train_loader_imputed, test_loader_imputed)

bootstrapped_resids_10_000_imputed = bootstrap_residuals(vae_fit_10_000_imputed, pd.DataFrame(test_loader_imputed.dataset, columns=FEATURE_COLUMNS))
Code
plot_heatmap(bootstrapped_resids_10_000_imputed, title="""Expected Correlation  Residuals  for 10,000 observations \n Under 1000 Bootstrapped Reconstructions""", colorbar=True, vmin=-.25, vmax=.25);

This contrast highlights a recurring lesson in applied modeling: simple can outperform complex - especially when the complexity introduces new sources of opaque inductive bias. The masked VAE encoded specific assumptions about missingness and reconstruction that, despite being theoretically appealing, failed to generalize. In contrast, the imputation-plus-VAE approach leverages a simpler inductive bias: that missing values can be smoothed over without disrupting the overall dependency structure. In this context, that assumption proved to be not only adequate but effective.

Bayesian Inference

In contrast to the VAE, where inductive bias emerges from the architecture and training dynamics, the Bayesian model defines its assumptions transparently through the prior. Here, we specify a multivariate normal likelihood with a mean vector and a covariance matrix drawn from an LKJ prior—a structured, interpretable distribution over valid correlation matrices. The model focuses directly on estimating the joint structure we care about, rather than reconstructing data through a bottleneck.

This PyMC model embodies the Bayesian workflow: first, we define a prior over parameters (here, the mean and correlation structure of features); second, we update this prior in light of observed data via Bayes’ rule, yielding the posterior; and finally, we generate samples from the posterior predictive distribution, which reflects the uncertainty in both parameter estimates and future data. These steps—prior → posterior → predictive sampling—form a disciplined approach to inference that naturally captures parameter uncertainty.

def make_pymc_model(sample_df):
    coords = {'features': FEATURE_COLUMNS,
            'features1': FEATURE_COLUMNS ,
            'obs': range(len(sample_df))}

    with pm.Model(coords=coords) as model:
        # Priors
        mus = pm.Normal("mus", 0, 1, dims='features')
        chol, _, _ = pm.LKJCholeskyCov("chol", n=12, eta=1.0, sd_dist=pm.HalfNormal.dist(1))
        cov = pm.Deterministic('cov', pm.math.dot(chol, chol.T), dims=('features', 'features1'))

        pm.MvNormal('likelihood', mus, cov=cov, observed=sample_df.values, dims=('obs', 'features'))
        
        idata = pm.sample_prior_predictive()
        idata.extend(pm.sample(random_seed=120))
        pm.sample_posterior_predictive(idata, extend_inferencedata=True)

    return idata, model 

sample_df = make_sample(cov_matrix, 263, FEATURE_COLUMNS)

idata, model = make_pymc_model(sample_df)

While the VAE also performs generative modeling, it does so with less targeted control over latent dependencies. Its inductive bias is entangled in neural architecture choices—nonlinearities, hidden dimensions, initialization schemes—making its behavior harder to interpret. The Bayesian model, by contrast, isolates the structure of interest in the prior and expresses uncertainty explicitly through the posterior. This cleaner separation of modeling goals is a strength: for structured estimation tasks like recovering correlation matrices, simpler Bayesian formulations often outperform complex, less interpretable alternatives.

pm.model_to_graphviz(model)

An additional advantage of the Bayesian approach is that the explicit inductive bias encoded in the prior helps stabilize estimation in low-data regimes. Because we constrain the model to estimate only correlation structures consistent with the LKJ prior—i.e., valid, regularized correlation matrices—it becomes possible to recover the essential structure of the data even with a small sample size. In this case, using just 263 observations, the model still reconstructs the underlying correlation matrix with notable accuracy. This is a powerful demonstration of the bias–variance trade-off in action: by introducing structured assumptions through the prior, we reduce variance in our estimates and avoid overfitting, which would be more likely in a more flexible, under-constrained model like the VAE.

expected_corr = pd.DataFrame(az.summary(idata, var_names=['chol_corr'])['mean'].values.reshape((12, 12)), columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS)

resids = pd.DataFrame(corr_matrix, columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS) - expected_corr
plot_heatmap(resids)

Missing Data

One of the most elegant features of Bayesian modeling is its native ability to handle missing data through implicit imputation. In our probabilistic model, we never need to manually fill in missing values or mask them—PyMC automatically marginalizes over the missing entries in the observed data during sampling. The model treats these missing values as additional unknowns and estimates them jointly with the other parameters using the same prior and likelihood structure. This is particularly valuable in survey data settings, such as job satisfaction responses, where non-response bias is common and must be accounted for directly.

sample_df_missing = make_sample(cov_matrix, 263, FEATURE_COLUMNS, missing_frac=0.1)

#| output: false 
idata_missing, model_missing = make_pymc_model(sample_df_missing)
Sampling: [chol, likelihood_observed, likelihood_unobserved, mus]
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mus, chol, likelihood_unobserved]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 15 seconds.
Sampling: [likelihood_observed]

Our example shows that, even with 10% missingness in a modest-sized dataset of only 263 observations, the posterior over the correlation matrix remains well-calibrated, accurately recovering the underlying covariance structure. This demonstrates the strength of Bayesian inference: its inductive bias (in this case via the LKJ prior) not only makes estimation more robust but also provides principled uncertainty quantification for both model parameters and imputed values, all within a unified framework.

pm.model_to_graphviz(model_missing)

The model structure automatically parses out the observed from missing cells and imputes them as required automatically.

expected_corr = pd.DataFrame(az.summary(idata_missing, var_names=['chol_corr'])['mean'].values.reshape((12, 12)), columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS)

resids = pd.DataFrame(corr_matrix, columns=FEATURE_COLUMNS, index=FEATURE_COLUMNS) - expected_corr
plot_heatmap(resids)

Conclusion and Key Takeaways

  • Inductive Bias Shapes Learning: The assumptions built into models ,whether explicit priors or implicit architectures, directly influence what patterns are captured and how well they generalize.

  • Latent Spaces Are Tricky: Heuristics in latent space can be elusive; interpretability and validation demand careful scrutiny beyond surface-level metrics.

  • Simplicity Often Wins: Straightforward imputation paired with simpler models can outperform complex architectures when it comes to reconstructing key data structures like correlations.

  • Bayesian Frameworks Offer Clarity: Encoding domain knowledge through priors leads to cleaner, more targeted models that perform well even with limited data.

  • Implicit Handling of Missing Data: Bayesian inference naturally accommodates uncertainty and missingness without complex architectural tweaks.

  • Iterative Model Development: Effective modeling is a continuous process of refining assumptions, balancing complexity, and aligning with the data’s underlying structure.

Our exploration reveals a fundamental truth in statistical modeling: building models is not just about fitting data, but about embedding assumptions—inductive biases—that shape how we interpret and generalize from incomplete information. Complex models with rich latent spaces can mislead if their structure obscures what they truly learn. Simpler, well-targeted approaches, especially those grounded in principled Bayesian inference, often provide clearer, more reliable insight.

Ultimately, model development is an iterative art of balancing flexibility with structure, complexity with interpretability, and data with prior knowledge. Mastering this balance is key to robust inference, meaningful imputation, and trustworthy prediction in the face of uncertainty.

Citation

BibTeX citation:
@online{forde2025,
  author = {Forde, Nathaniel},
  title = {Heuristics in {Latent} {Space:} {VAEs} and {Bayesian}
    {Inference}},
  date = {2025-07-25},
  langid = {en},
  abstract = {The cost of generating new sample data can be prohibitive.
    There is a secondary but different cost which attaches to the
    “construction” of novel data. Is the cost worth paying?}
}
For attribution, please cite this work as:
Forde, Nathaniel. 2025. “Heuristics in Latent Space: VAEs and Bayesian Inference.” July 25, 2025.